Bayesian Logistic Regression Capstone
  • Research
  • Slides
  • About Us

On this page

  • Introduction
    • Aims
  • Method
    • Bayesian Logistic Regression
    • Model Structure
    • Bayesian Approach
      • Prior Specification
    • Advantages of Bayesian Logistic Regression
    • Posterior Predictions
    • Model Evaluation and Diagnostics
  • Analysis and Results
    • Statistical Tool
    • Data source
    • Data pre-processing
    • Data Variables
    • Exploratory Data Analysis (Adult, 20 - 80 years)
    • Study population (Adult, NHANES)
      • Population characterisitcs
    • Abnormalities detected in Adult analytic dataset
      • Missingness and MICE
  • Statistical Modeling
    • Summary results from Multiple Logistic Regression model using survey weights
    • Multivariate Imputation by Chained Equations
      • Results from MICE (pooled imputed dataset):
      • Summary of Imputed Adult Dataset (N = 5,592)
      • Visualization of imputed dataset
  • Bayesian Logistic Regression analysis on imputed dataset (adult_imp1)
    • Visualization of Priors and Data Distributions
  • Visualization from Posterior Draw for predictive checking in Bayesian statistics
    • Comparative Visualizations
      • Comparative Visualizations for Model Assessment
      • Predicted vs Observed BMI
      • Visualization on Prior vs Posterior Distributions
    • Comaprison of Prior and Predicted draws for both Age and BMI
    • Propotion of diabetes in the posterior draws
    • Posterior predicted proportion of Diabetes vs NHANES prevalence of Diabetes in the population
      • Survey-Weighted Prevalence (8.9%)
      • Imputed (Unweighted) Prevalence (11.1%)
      • Posterior Predictive Mean (10.9%)
      • Summarizing
    • MCMC Autocorrelation for Key Parameters
  • Model Overview and Significant Predictors
  • Conclusion
  • Discussions
  • Limitations
  • QUESTION for targeted therapy
    • Translational Research Implications: - We now use the model to guide
    • Internal validation
    • Targeted BMI Analysis for Predicted Diabetes Risk
    • Implications
    • References

Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra, Autumn Wilcox (Advisor: Dr. Ashraf Cohen)

Published

October 29, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Introduction

Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.

Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.

In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.

The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variational inference, and variable selection.Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.

Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)

A sequential clinical reasoning model Liu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.

Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.

Aims

The present study aims performs Bayesian logistic regression to predict diabetes status and evaluate the associations between diabetes and predictors (body mass index (BMI), age (≥20 years), gender, and race). The study anakyzes a retrospective dataset (2013–2014 NHANES survey data). It is based on a complex sampling design, characterized by stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach addresses challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.

Method

Bayesian Logistic Regression

The study employs Bayesian logistic regression to estimate associations between predictors and outcome probabilities.
The Bayesian framework integrates prior information with observed data to generate posterior distributions, allowing direct probabilistic interpretation of parameters.
This approach provides flexibility in model specification, accounts for parameter uncertainty, and produces credible intervals that fully reflect uncertainty in the estimates.
Unlike traditional frequentist methods, the Bayesian method enables inference through posterior probabilities, supporting more nuanced decision-making and interpretation.


Model Structure

  • Bayesian logistic regression is a probabilistic modeling framework used to estimate the relationship between one or more predictors (continuous or categorical) and a binary outcome (e.g., presence/absence of disease).

  • It extends classical logistic regression by combining it with Bayesian inference, treating model parameters as random variables with probability distributions rather than fixed point estimates.

  • The logistic model relates the probability of an outcome ( Y = 1 ) to a linear combination of predictors through the logit link function:

    [ (P(Y = 1)) = _0 + _1 X_1 + _2 X_2 + + _k X_k ]

    logit(pi)=β0+j=1∑pβjxij

    p_i: the probability of the event (e.g., having diabetes) for individual i. “logit”(p_i)=log⁡(p_i/(1-p_i )): the log-odds of the event. β_0: the intercept — the log-odds of the event when all predictors x_ij=0. β_j: the coefficient for predictor x_j, representing the change in log-odds for a one-unit increase in x_ij, holding other variables constant. ∑_(j=1)^p β_j x_ij: the combined linear effect of all predictors.

  • In the Bayesian framework, the coefficients ( ) are assigned prior distributions, which are updated in light of the observed data to yield posterior distributions.


Bayesian Approach

  • The Bayesian approach naturally incorporates uncertainty in all model parameters.

  • It combines prior beliefs with observed data to produce posterior distributions according to Bayes’ theorem:

    [ ]

  • Likelihood: Represents the probability of the observed data given the model parameters (as in classical logistic regression).

  • Prior: Encodes prior knowledge or beliefs about parameter values before observing the data.

  • Posterior: Represents updated beliefs about parameters after observing the data.

Prior Specification

A weakly informative Student’s t-distribution prior, student_t(3, 0, 10), was used for regression coefficients (van de Schoot et al., 2013).
This prior:
- Has 3 degrees of freedom (( = 3 )), producing heavy tails that allow for occasional large effects.
- Is centered at 0 (( = 0 )), reflecting no initial bias toward positive or negative associations.
- Has a scale parameter of 10 (( = 10 )), allowing broad variation in possible coefficient values.
Such priors improve stability in models with small sample sizes, high collinearity, or potential outliers.


Advantages of Bayesian Logistic Regression

  • Uncertainty quantification: Produces full posterior distributions instead of single-point estimates.
  • Credible intervals: Provide the range within which a parameter lies with a specified probability (e.g., 95%).
  • Flexible priors: Allow incorporation of expert knowledge or results from previous studies.
  • Probabilistic predictions: Posterior predictive distributions yield direct probabilities for future observations.
  • Comprehensive model checking: Posterior predictive checks (PPCs) evaluate how well simulated outcomes reproduce observed data.

Posterior Predictions

Posterior distributions of the coefficients are used to estimate the probability of the outcome for given predictor values.
This enables statements such as:
> “Given the predictors, the probability of the outcome lies between X% and Y%.”

Posterior predictions incorporate two sources of uncertainty:
- Parameter uncertainty: Variability in estimated model coefficients.
- Predictive uncertainty: Variability in future outcomes given those parameters.


Model Evaluation and Diagnostics

Model quality and convergence were assessed using standard Bayesian diagnostics:

  • Convergence diagnostics: Markov Chain Monte Carlo (MCMC) performance was evaluated using ( ) (R-hat) and effective sample size (ESS).
  • Autocorrelation checks: Ensured independence between successive MCMC draws.
  • Posterior predictive checks (PPC): Compared simulated data from posterior distributions to observed outcomes.
  • Bayesian R²: Quantified the proportion of variance explained by the predictors, incorporating uncertainty.

In Bayesian analysis, every unknown parameter — such as a regression coefficient, mean, or variance — is treated as a random variable with a probability distribution that expresses uncertainty given the observed data.

Analysis and Results

Statistical Tool

R packages and libraries are used to import data, perform data wrangling and analysis.

Data source

  • NHANES 2-year data (2013-2014) cross-sectional weighted data Center for Health Statistics (1999) was imported in R

Data pre-processing

Adult dataset:Three NHANES datasets (demographics, exam, questionnaire) in.XPT format are imported (Haven package) in R. Variables of interest are selected using the original weighted datasets and ID to create a single adult analytic dataframe.

Data Variables

Response Variable - Binary Type 2 / diagnosed diabetes(excluding gestational diabetes) diabetes_dx created combning - DIQ010 - Doctor told you have diabetes - DIQ050- excluded (a secondary variable describing treatment status (insulin use)). Predictor Variables - Body Mass Index, factor, 4 levels
Covariates - Gender (factor, 2 levels) - Ethnicity (factor, 5 levels) - Age (continuous 20-80years)

Code
library ("nhanesA")     
 # imported datasets 
                       nhanesTables('EXAM', 2013)
   Data.File.Name
1           BPX_H
2        DXXSPN_H
3        DXXFEM_H
4        DXXVFA_H
5           BMX_H
6        DXXFRX_H
7           CSX_H
8           MGX_H
9        OHXDEN_H
10       OHXPER_H
11       OHXREF_H
12       FLXCLN_H
13       DXXAAC_H
14        DXXT4_H
15        DXXT5_H
16        DXXT6_H
17        DXXT7_H
18        DXXT8_H
19        DXXT9_H
20       DXXT10_H
21       DXXT11_H
22       DXXT12_H
23        DXXL1_H
24        DXXL2_H
25        DXXL3_H
26        DXXL4_H
27          DXX_H
28       PAXDAY_H
29        PAXHD_H
30        PAXHR_H
31       PAXMIN_H
32        DXXAG_H
                                               Data.File.Description
1                                                     Blood Pressure
2                           Dual-Energy X-ray Absorptiometry - Spine
3                           Dual-Energy X-ray Absorptiometry - Femur
4   Dual-Energy X-ray Absorptiometry - Vertebral Fracture Assessment
5                                                      Body Measures
6                      Dual-Energy X-ray Absorptiometry - FRAX Score
7                                                      Taste & Smell
8                                        Muscle Strength - Grip Test
9                                            Oral Health - Dentition
10                                         Oral Health - Periodontal
11                              Oral Health - Recommendation of Care
12                                              Fluorosis - Clinical
13 Dual-Energy X-ray Absorptiometry - Abdominal Aortic Calcification
14        Dual-Energy X-ray Absorptiometry - T4 Vertebrae Morphology
15        Dual-Energy X-ray Absorptiometry - T5 Vertebrae Morphology
16        Dual-Energy X-ray Absorptiometry - T6 Vertebrae Morphology
17        Dual-Energy X-ray Absorptiometry - T7 Vertebrae Morphology
18        Dual-Energy X-ray Absorptiometry - T8 Vertebrae Morphology
19        Dual-Energy X-ray Absorptiometry - T9 Vertebrae Morphology
20       Dual-Energy X-ray Absorptiometry - T10 Vertebrae Morphology
21       Dual-Energy X-ray Absorptiometry - T11 Vertebrae Morphology
22       Dual-Energy X-ray Absorptiometry - T12 Vertebrae Morphology
23        Dual-Energy X-ray Absorptiometry - L1 Vertebrae Morphology
24        Dual-Energy X-ray Absorptiometry - L2 Vertebrae Morphology
25        Dual-Energy X-ray Absorptiometry - L3 Vertebrae Morphology
26        Dual-Energy X-ray Absorptiometry - L4 Vertebrae Morphology
27                     Dual-Energy X-ray Absorptiometry - Whole Body
28                                   Physical Activity Monitor - Day
29                                Physical Activity Monitor - Header
30                                  Physical Activity Monitor - Hour
31                                Physical Activity Monitor - Minute
32    Dual-Energy X-ray Absorptiometry - Android/Gynoid Measurements
Code
                       nhanesTables('QUESTIONNAIRE', 2013)
   Data.File.Name                 Data.File.Description
1           DLQ_H                            Disability
2           DEQ_H                           Dermatology
3           OSQ_H                          Osteoporosis
4           IMQ_H                          Immunization
5           SXQ_H                       Sexual Behavior
6           CDQ_H                 Cardiovascular Health
7           BPQ_H          Blood Pressure & Cholesterol
8           MCQ_H                    Medical Conditions
9           HIQ_H                      Health Insurance
10          HUQ_H Hospital Utilization & Access to Care
11          PAQ_H                     Physical Activity
12          PFQ_H                  Physical Functioning
13          HEQ_H                             Hepatitis
14          ECQ_H                       Early Childhood
15          DIQ_H                              Diabetes
16       SMQFAM_H           Smoking - Household Smokers
17          SMQ_H               Smoking - Cigarette Use
18       SMQRTU_H          Smoking - Recent Tobacco Use
19          HOQ_H               Housing Characteristics
20       PUQMEC_H                         Pesticide Use
21       SMQSHS_H   Smoking - Secondhand Smoke Exposure
22          INQ_H                                Income
23          CSQ_H                         Taste & Smell
24          DBQ_H             Diet Behavior & Nutrition
25          CBQ_H                     Consumer Behavior
26          HSQ_H                 Current Health Status
27          SLQ_H                       Sleep Disorders
28       RXQASA_H                Preventive Aspirin Use
29          DUQ_H                              Drug Use
30       WHQMEC_H                Weight History - Youth
31          ALQ_H                           Alcohol Use
32          DPQ_H   Mental Health - Depression Screener
33          ACQ_H                         Acculturation
34          WHQ_H                        Weight History
35          RHQ_H                   Reproductive Health
36          FSQ_H                         Food Security
37          OHQ_H                           Oral Health
38          OCQ_H                            Occupation
39       RXQ_RX_H              Prescription Medications
40        KIQ_U_H           Kidney Conditions - Urology
41          CKQ_H                       Creatine Kinase
42          VTQ_H                     Volatile Toxicant
43          CFQ_H                 Cognitive Functioning
Code
                       nhanesTables('DEMOGRAPHICS', 2013)
  Data.File.Name                    Data.File.Description
1         DEMO_H Demographic Variables and Sample Weights
Code
# codebook for variable details

nhanesCodebook("DEMO_H",'RIDRETH1')
$`Variable Name:`
[1] "RIDRETH1"

$`SAS Label:`
[1] "Race/Hispanic origin"

$`English Text:`
[1] "Recode of reported race and Hispanic origin information"

$`Target:`
[1] "Both males and females 0 YEARS -\r 150 YEARS"

$RIDRETH1
# A tibble: 6 × 5
  `Code or Value` `Value Description`            Count Cumulative `Skip to Item`
  <chr>           <chr>                          <int>      <int> <lgl>         
1 1               Mexican American                1730       1730 NA            
2 2               Other Hispanic                   960       2690 NA            
3 3               Non-Hispanic White              3674       6364 NA            
4 4               Non-Hispanic Black              2267       8631 NA            
5 5               Other Race - Including Multi-…  1544      10175 NA            
6 .               Missing                            0      10175 NA            
Code
nhanesCodebook("DEMO_H",'RIAGENDR')
$`Variable Name:`
[1] "RIAGENDR"

$`SAS Label:`
[1] "Gender"

$`English Text:`
[1] "Gender of the participant."

$`Target:`
[1] "Both males and females 0 YEARS -\r 150 YEARS"

$RIAGENDR
# A tibble: 3 × 5
  `Code or Value` `Value Description` Count Cumulative `Skip to Item`
  <chr>           <chr>               <int>      <int> <lgl>         
1 1               Male                 5003       5003 NA            
2 2               Female               5172      10175 NA            
3 .               Missing                 0      10175 NA            
Code
nhanesCodebook("DEMO_H",'RIDAGEYR')
$`Variable Name:`
[1] "RIDAGEYR"

$`SAS Label:`
[1] "Age in years at screening"

$`English Text:`
[1] "Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age."

$`Target:`
[1] "Both males and females 0 YEARS -\r 150 YEARS"

$RIDAGEYR
# A tibble: 3 × 5
  `Code or Value` `Value Description`      Count Cumulative `Skip to Item`
  <chr>           <chr>                    <int>      <int> <lgl>         
1 0 to 79         Range of Values           9823       9823 NA            
2 80              80 years of age and over   352      10175 NA            
3 .               Missing                      0      10175 NA            
Code
nhanesCodebook("DIQ_H","DIQ010, DIQ050")
named list()
Code
nhanesCodebook("BMX_H",'BMXBMI')
$`Variable Name:`
[1] "BMXBMI"

$`SAS Label:`
[1] "Body Mass Index (kg/m**2)"

$`English Text:`
[1] "Body Mass Index (kg/m**2)"

$`Target:`
[1] "Both males and females 2 YEARS -\r 150 YEARS"

$BMXBMI
# A tibble: 2 × 5
  `Code or Value` `Value Description` Count Cumulative `Skip to Item`
  <chr>           <chr>               <int>      <int> <lgl>         
1 12.1 to 82.9    Range of Values      9055       9055 NA            
2 .               Missing               758       9813 NA            
Code
  #  .xpt files read ( 2013–2014)                      
                      bmx_h <- nhanes("BMX_H")         #Exam
                      demo_h <- nhanes("DEMO_H")       #Demo
                      diq_h <- nhanes("DIQ_H")         #diabetes

# variables of interest

library(dplyr)

exam_sub <- bmx_h %>% 
  select(SEQN, BMXBMI) %>%
  rename(
    ID = SEQN,
    BMI = BMXBMI
  )

need_demo <- c("SEQN","RIDAGEYR","RIAGENDR","RIDRETH1","SDMVPSU","SDMVSTRA","WTMEC2YR")
stopifnot(all(c("SEQN","BMXBMI") %in% names(bmx_h)))
stopifnot(all(need_demo %in% names(demo_h)))
if (!("DIQ010" %in% names(diq_h))) {
  stop("DIQ010 is not in DIQ_H. Check the cycle name 'DIQ_H' and nhanesA version.")
}

 #  Select only needed variables
exam_sub <- bmx_h  %>% select(SEQN, BMXBMI)
demo_sub <- demo_h %>% select(all_of(need_demo))
diq_sub  <- diq_h  %>% select(SEQN, DIQ010, dplyr::any_of("DIQ050"))

# merged dataframe

merged_data <- demo_sub %>%
  left_join(exam_sub, by = "SEQN") %>%
  left_join(diq_sub,  by = "SEQN")

names(merged_data)
 [1] "SEQN"     "RIDAGEYR" "RIAGENDR" "RIDRETH1" "SDMVPSU"  "SDMVSTRA"
 [7] "WTMEC2YR" "BMXBMI"   "DIQ010"   "DIQ050"  
Code
saveRDS(merged_data, "data/nhanes2013_2014_prepared.rds")
Variable Descriptions: Adult Dataset
Variable Description Type
diabetes_dx Diabetes diagnosis (1 = Yes, 0 = No) based on medical questionnaire. Categorical
age Age of participant in years. Continuous
bmi Body Mass Index (BMI) in kilograms per square meter (kg/m²), calculated from measured height and weight. Continuous
sex Sex of participant (Male or Female). Categorical
race Race/Ethnicity (e.g., Non-Hispanic White, Non-Hispanic Black, Mexican American, etc.). Categorical
WTMEC2YR Examination sample weight for MEC (Mobile Examination Center) participants. Weight
SDMVPSU Primary Sampling Unit (PSU) used for variance estimation in complex survey design. Design
SDMVSTRA Stratum variable used to define strata for complex survey design. Design
age_c Age variable centered and standardized (z-score). Continuous
bmi_c BMI variable centered and standardized (z-score). Continuous
wt_norm Normalized survey weight (WTMEC2YR divided by its mean, for model weighting). Weight

Exploratory Data Analysis (Adult, 20 - 80 years)

  • Discrete vs Continuous Columns: 25% of columns are discrete (categorical) while 75% are continuous, indicating that the dataset primarily contains continuous measurements such as age and BMI.
  • Complete Rows: 92.7% of rows have complete information for all variables, meaning most participants have fully observed data across predictors and outcomes.

Survey design: - It is a national survey based on complex sampling designs (oversampling certain groups (e.g., minorities, older adults) to ensure representation. - They use multistage sampling to represent the U.S. population, so we apply sampling weights, strata, and PSU (primary sampling units) for valid estimates. - We use survey design in regression anlaysis to avoid to avoid bias prevalence estimates (e.g., mean BMI or diabetes %), underestimation of standard errors and incorrect inference for population-level parameters. - It includes auxillary variables: SDMVPSU, SDMVSTRA, WTMEC2YR - - Diabetes grouped from (DIQ010 excluding DIQ050): diabetes_dx (numeric 0/1) - Covariates: ethnicity (5 levels), age range (20-80 years), gender (male and female), BMI as continuous - Centered covariates: age_c, bmi_c BMI categories: bmi_cat - Presented here is the mean, standard error and variance of the survey weighted data

Step Description
Weighting Used the survey package to calculate weighted means and standard deviations for all variables.
Standardization Standardized BMI and age variables for analysis.
Age Categorization Recoded into intervals: 20–<30, 30–<40, 40–<50, 50–<60, 60–<70, and 70–80 years.
BMI Categorization Recoded and categorized as: <18.5 (Underweight), 18.5–<25 (Normal), 25–<30 (Overweight), 30–<35 (Obesity I), 35–<40 (Obesity II), ≥40 (Obesity III).
Ethnicity Recoding Recoded as: 1 = Mexican American, 2 = Other Hispanic, 3 = Non-Hispanic White, 4 = Non-Hispanic Black, 5 = Other/Multi.
Special Codes Special codes (e.g., 3, 7) were transformed to NA. These codes are not random and could introduce bias if ignored (MAR or MNAR).
Missing Data Missing values were retained and visualized to assess their pattern and informativeness.
Final Dataset Created a cleaned analytic dataset (adult) using Non-Hispanic White and Male as reference groups for analysis.
Code
## 
# ---------------- Basic Exploration (adults) ----------------

# Keep adults only and build analysis variables
adult <- merged_data %>%
  dplyr::filter(RIDAGEYR >= 20) %>%
  dplyr::transmute(
    # --- keep survey design variables so svydesign() can see them ---
    SDMVPSU, SDMVSTRA, WTMEC2YR,

    # --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---
    diabetes_dx = dplyr::case_when(
      DIQ010 == 1 ~ 1,
      DIQ010 == 2 ~ 0,
      DIQ010 %in% c(3, 7, 9) ~ NA_real_,
      TRUE ~ NA_real_
    ),

    # --- predictors (raw) ---
    bmi  = BMXBMI,
    age  = RIDAGEYR,

    # sex (1=Male, 2=Female)
    sex  = forcats::fct_recode(factor(RIAGENDR), Male = "1", Female = "2"),

    # race (5-level)
    race = forcats::fct_recode(
      factor(RIDRETH1),
      "Mexican American" = "1",
      "Other Hispanic"   = "2",
      "NH White"         = "3",
      "NH Black"         = "4",
      "Other/Multi"      = "5"
    ),

    # keep DIQ050 so we can safely reference it (may be absent/NA in some rows)
    
    DIQ050 = DIQ050
  ) %>%
  # standardize continuous predictors
  dplyr::mutate(
    age_c = as.numeric(scale(age)),
    bmi_c = as.numeric(scale(bmi)),
    bmi_cat = cut(
      bmi,
      breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
      labels = c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),
      right = FALSE
    )
  ) %>%
  # adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes)
  dplyr::mutate(
    diabetes_dx = ifelse(sex == "Female" & !is.na(DIQ050) & DIQ050 == 1, 0, diabetes_dx)
  )

# Make NH White the reference level for race (clearer interpretation)
adult <- adult %>%
  dplyr::mutate(
    race = forcats::fct_relevel(race, "NH White")
  )

# --- sanity checks ---
cat("Adults n =", nrow(adult), "\n")
Adults n = 5769 
Code
#| label: survey design
#| echo: false
# survey design
# ---------------- Survey Design ----------------
# Use exam weights because BMI (BMXBMI) is an MEC variable

nhanes_design_adult <- survey::svydesign(
  id = ~SDMVPSU,
  strata = ~SDMVSTRA,
  weights = ~WTMEC2YR,
  nest = TRUE,
  data = adult
)

# quick weighted checks
survey::svymean(~age, nhanes_design_adult, na.rm = TRUE)
      mean     SE
age 47.496 0.3805
Code
survey::svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
                mean     SE
diabetes_dx 0.089016 0.0048
Code
# Calculate effective sample size for diabetes

# Variance ignoring survey design (i.e., assuming SRS)
v <- svyvar(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
p <- mean(adult$diabetes_dx, na.rm = TRUE)
v_srs <- p * (1 - p) / nrow(adult)

# Design effect = actual variance / SRS variance
deff <- v / v_srs
deff  # design effect
            variance     SE
diabetes_dx   4759.9 0.0039
Code
n_total <- sum(weights(nhanes_design_adult))
ess <- n_total / deff
cat("Effective sample size for diabetes_dx:", round(ess), "\n")
Effective sample size for diabetes_dx: 48142 
Code
library(dplyr)
library(skimr)
library(knitr)
library(tidyr)
library(purrr)
library(forcats)
library(kableExtra)

str(adult)
'data.frame':   5769 obs. of  12 variables:
 $ SDMVPSU    : num  1 1 1 2 1 1 2 1 2 2 ...
 $ SDMVSTRA   : num  112 108 109 116 111 114 106 112 112 113 ...
 $ WTMEC2YR   : num  13481 24472 57193 65542 25345 ...
 $ diabetes_dx: num  1 1 1 0 0 0 0 0 0 0 ...
 $ bmi        : num  26.7 28.6 28.9 19.7 41.7 35.7 NA 26.5 22 20.3 ...
 $ age        : num  69 54 72 73 56 61 42 56 65 26 ...
 $ sex        : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 1 2 1 2 ...
 $ race       : Factor w/ 5 levels "NH White","Mexican American",..: 4 1 1 1 2 1 3 1 1 1 ...
 $ DIQ050     : num  1 1 1 2 2 2 2 2 2 2 ...
 $ age_c      : num  1.132 0.278 1.303 1.36 0.392 ...
 $ bmi_c      : num  -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
 $ bmi_cat    : Factor w/ 6 levels "<18.5","18.5–<25",..: 3 3 3 2 6 5 NA 3 2 2 ...
Code
plot_str(adult)
head(adult)
  SDMVPSU SDMVSTRA WTMEC2YR diabetes_dx  bmi age    sex             race DIQ050
1       1      112 13481.04           1 26.7  69   Male         NH Black      1
2       1      108 24471.77           1 28.6  54   Male         NH White      1
3       1      109 57193.29           1 28.9  72   Male         NH White      1
4       2      116 65541.87           0 19.7  73 Female         NH White      2
5       1      111 25344.99           0 41.7  56   Male Mexican American      2
6       1      114 61758.65           0 35.7  61 Female         NH White      2
      age_c       bmi_c  bmi_cat
1 1.1324183 -0.33588609   25–<30
2 0.2783598 -0.07028101   25–<30
3 1.3032300 -0.02834336   25–<30
4 1.3601672 -1.31443114 18.5–<25
5 0.3922343  1.76099614      ≥40
6 0.6769204  0.92224325   35–<40
Code
plot_intro(adult, title="Figure 1 (Adult dataset). Structure of variables and missing observations.")

Code
plot_missing(adult, title="Figure 2(Adult dataset). Breakdown of missing observations.")

Study population (Adult, NHANES)

  • Number of participants in Adult dataset (n = 5769) after cleaning and analysis Variables
  • Mean, SE and vairance for age, diabetes_dx mean SE age 47.496 0.3805 diabetes_dx 0.089016 0.0048 variance SE diabetes_dx 4759.9 0.0039

Effective sample size = diabetes_dx: 48142

Population characterisitcs

  • Age: Participants are fairly evenly distributed across adult age groups, with no sharp skewness.
  • Sex: sample includes a higher proportion of females than males.
  • BMI: Most participants have BMI values within the normal to overweight range, with fewer in the obese category.
  • BMI by Diabetes Status: Individuals diagnosed with diabetes tend to have higher BMI values compared to non-diabetics.
  • Diabetes Prevalence by Age Group: The proportion of diabetes increases with advancing age, highlighting age as a strong risk factor.
  • Diabetes Prevalence by Race/Ethnicity: Differences are observed across racial/ethnic groups, with some showing higher prevalence rates than others.

Visualization - Below are histogram, bar graph, boxplot to display age, bmi and their association with diabetes status. Plot below shows males and females with and without diabetes (including missing data) across different racial groups. Bars are side by side for each sex, with counts displayed on top

Code
ggplot(adult, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white") +
  labs(
    title = "Distribution of Age >20 years",
    x = "Age (years)",
    y = "Count"
  ) +
  theme_minimal()

Code
ggplot(adult, aes(factor(diabetes_dx))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution in >20 years age group", x="diabetes_dx (0=No, 1=Yes)", y="Count")

Code
ggplot(adult, aes(factor(bmi_cat))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution by BMI in >20 years age group", x="bmi_cat")

Code
ggplot(adult, aes(x = factor(diabetes_dx), y = bmi)) +
  geom_boxplot(fill = "skyblue") +
  labs(
    title = "BMI Distribution by Diabetes Diagnosis in >20 years age group",
    x = "Diabetes Diagnosis (0 = No, 1 = Yes)",
    y = "BMI"
  ) +
  theme_minimal()

Code
# plots for adult data bmi categories and race categories

ggplot(adult, aes(x = factor(race), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by Race in >20 years age group",
    x = "Race/Ethnicity",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
ggplot(adult, aes(x = factor(bmi_cat), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by BMI in >20 years age group",
    x = "BMI",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
# Example: create your dataset
adult1 <- data.frame(
  race = rep(c("NH White","Mexican American","Other Hispanic","NH Black","Other/Multi"), each = 6),
  sex = rep(c("Male","Male","Male","Female","Female","Female"), times = 5),
  diabetes_dx = rep(c(0,1,NA,0,1,NA), times = 5),
  count = c(
    1019,119,38,1164,96,36,
    304,60,14,329,49,11,
    183,26,10,255,25,9,
    461,100,19,515,65,17,
    351,46,8,393,32,15
  )
)

# Clean NA for plotting or convert to "Missing"
adult1$diabetes_dx <- as.character(adult1$diabetes_dx)
adult1$diabetes_dx[is.na(adult1$diabetes_dx)] <- "Missing"

# Plot grouped bar chart
ggplot(adult1, aes(x = diabetes_dx, y = count, fill = sex)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~race) +
  labs(title = "Diabetes Diagnosis by Sex and Race",
       x = "Diabetes Diagnosis",
       y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "orange"))

Abnormalities detected in Adult analytic dataset

Missingness and MICE

  • Only 1.3% of individual data points are missing across the dataset, reflecting minimal missingness.
  • No column is entirely missing (0%), indicating all variables have at least some observed data.
  • Overall missingness: ~4% → low, but non-trivial given the small number of variables involved.
  • Missingness is not completely at random (MNAR or MAR) - If the probability of missingness depends on other observed variables (e.g., older adults missing BMI due to illness), imputation helps reduce bias. It is possible and should consider MICE and test with logistic regression of missingness indicators
  • Missingness affects outcome or key covariates - Even small missingness in important variables can bias posterior estimates. Since BMI and diabetes are central we should perform MICE
  • Sufficient auxiliary variables available - MICE works best when you have other correlated variables to inform imputation (e.g., age, sex, race, WTMEC2YR).
  • Bayesian model assumes complete data - Standard Bayesian logistic models (e.g., brms, rstanarm) cannot directly handle NAs — you must impute or model missingness.

Statistical Modeling

Modeling Approach

  1. Modeling-Survey-weighted complete-case - To explore associations between predictors and diabetes status, we applied a frequentist - Multiple Logistic Regression model using survey weights to account for the complex sampling design of the dataset. This ensured population-representative estimates and valid standard errors.

  2. Multivariate Imputation by Chained Equations (MICE) - Missing data were addressed using MICE, an iterative regression-based approach that imputes each incomplete variable conditionally on the others. Multiple imputed datasets were generated and analyzed to reduce bias and increase statistical efficiency.

  3. Bayesian analysis - Following imputation, we conducted a Bayesian analysis to complement the frequentist findings, allowing incorporation of uncertainty and prior information in estimating model parameters and posterior predictive distributions.

Summary results from Multiple Logistic Regression model using survey weights

Survey-weighted odds ratios (per 1 SD)
term OR LCL UCL p.value
age_c 3.0292807 2.6967690 3.4027912 0.0000000
bmi_c 1.8853571 1.6526296 2.1508579 0.0000039
sexFemale 0.5281132 0.4104905 0.6794397 0.0003857
raceMexican American 2.0358434 1.4850041 2.7910081 0.0008262
raceOther Hispanic 1.5915182 1.1664529 2.1714810 0.0087119
raceNH Black 1.6689718 1.1605895 2.4000450 0.0116773
raceOther/Multi 2.3270527 1.5451752 3.5045697 0.0014331

Survey-Weighted Multiple Logistic Regression - Identified several significant predictors of diabetes diagnosis after adjusting for demographic and anthropometric factors. - Age (OR = 2.90, 95% CI: 2.60–3.24, p < 0.001): Older adults had nearly three times higher odds of diabetes compared with younger participants, indicating a strong positive association between age and diabetes risk. - BMI (OR = 1.73, 95% CI: 1.58–1.89, p < 0.001): Higher body mass index was significantly associated with increased odds of diabetes, confirming obesity as a key risk factor. - Sex (Female vs. Male: OR = 0.54, 95% CI: 0.45–0.65, p < 0.001): Females had significantly lower odds of diabetes compared to males. - Race/Ethnicity: Mexican American (OR = 2.43, 95% CI: 1.86–3.18, p < 0.001) Other Hispanic (OR = 1.75, 95% CI: 1.24–2.47, p = 0.001) Non-Hispanic Black (OR = 1.98, 95% CI: 1.56–2.50, p < 0.001) Other/Multi-racial (OR = 2.11, 95% CI: 1.56–2.85, p < 0.001) All minority racial/ethnic groups had significantly higher odds of diabetes compared with the reference group (Non-Hispanic Whites).

Multivariate Imputation by Chained Equations

Multiple Imputation Pooled Logistic Regression

  • We conducted MICE to manage missiging data as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
  • Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable Van Buuren and Van Buuren (2012).
  • Multiple Imputation (MI) can be performed using mice package in R
  • Iterative mice imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.

Below are the rows and cloumn numbers from the three datasets created

  • Rows: 10175 and Columns: 10 (survey-weighted, merged data)
  • Rows: 5,769 and Columns: 12 (filtered data, adult)
  • Rows: 5,592 and Columns: 11 (imputed data, adult_imp1)
Code
# ----- Multiple Imputation (predictors only) 
mi_dat <- adult %>%
  dplyr::select(diabetes_dx, age, bmi, sex, race, WTMEC2YR, SDMVPSU, SDMVSTRA)

meth <- mice::make.method(mi_dat)
pred <- mice::make.predictorMatrix(mi_dat)

# Do not impute outcome
meth["diabetes_dx"] <- ""
pred["diabetes_dx", ] <- 0
pred[,"diabetes_dx"] <- 1

# Imputation methods
meth["age"]  <- "norm"
meth["bmi"]  <- "pmm"
meth["sex"]  <- "polyreg"
meth["race"] <- "polyreg"

# Survey design vars as auxiliaries only
meth[c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- ""
pred[, c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- 1

glimpse(mi_dat)
Rows: 5,769
Columns: 8
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, NA, 26.5, 22.0, 20.3, …
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
Code
imp <- mice::mice(mi_dat, m = 5, method = meth, predictorMatrix = pred, seed = 123)

 iter imp variable
  1   1  bmi
  1   2  bmi
  1   3  bmi
  1   4  bmi
  1   5  bmi
  2   1  bmi
  2   2  bmi
  2   3  bmi
  2   4  bmi
  2   5  bmi
  3   1  bmi
  3   2  bmi
  3   3  bmi
  3   4  bmi
  3   5  bmi
  4   1  bmi
  4   2  bmi
  4   3  bmi
  4   4  bmi
  4   5  bmi
  5   1  bmi
  5   2  bmi
  5   3  bmi
  5   4  bmi
  5   5  bmi

Results from MICE (pooled imputed dataset):

  • In the final analytic dataset consisting of 5,769 participants with 8 variables, with missing values were addressed using Multivariate Imputation by Chained Equations (MICE).
  • Five imputations were performed across five iterations each, with BMI imputed conditionally based on other predictors (age, sex, race, and diabetes status).
  • The iterative process showed stable convergence, indicating reliable estimation of missing BMI values for subsequent survey-weighted and Bayesian modeling analyses.
Code
fit_mi <- with(imp, {
  age_c <- as.numeric(scale(age))
  bmi_c <- as.numeric(scale(bmi))
  glm(diabetes_dx ~ age_c + bmi_c + sex + race, family = binomial())
})
pool_mi <- pool(fit_mi)
summary(pool_mi)
                  term   estimate  std.error  statistic       df       p.value
1          (Intercept) -2.6895645 0.09941301 -27.054453 5566.204 1.486581e-151
2                age_c  1.0660265 0.05594733  19.054108 5520.446  1.911564e-78
3                bmi_c  0.5468538 0.04473386  12.224604 5148.557  6.751227e-34
4            sexFemale -0.6178297 0.09379129  -6.587282 5551.660  4.892566e-11
5 raceMexican American  0.8877355 0.13750463   6.456041 5472.583  1.167455e-10
6   raceOther Hispanic  0.5606621 0.17485537   3.206433 5573.987  1.351505e-03
7         raceNH Black  0.6809629 0.11981185   5.683602 5576.734  1.385727e-08
8      raceOther/Multi  0.7476406 0.15300663   4.886328 4749.963  1.061140e-06
Code
## table 

mi_or <- summary(pool_mi, conf.int = TRUE, exponentiate = TRUE) %>%
  dplyr::rename(
    term = term, OR = estimate, LCL = `2.5 %`, UCL = `97.5 %`, p.value = p.value
  ) %>%
  dplyr::filter(term != "(Intercept)")
knitr::kable(mi_or, caption = "MI pooled odds ratios (per 1 SD)")
MI pooled odds ratios (per 1 SD)
term OR std.error statistic df p.value LCL UCL conf.low conf.high
2 age_c 2.9038183 0.0559473 19.054108 5520.446 0.0000000 2.6021752 3.2404277 2.6021752 3.2404277
3 bmi_c 1.7278084 0.0447339 12.224604 5148.557 0.0000000 1.5827382 1.8861754 1.5827382 1.8861754
4 sexFemale 0.5391132 0.0937913 -6.587282 5551.660 0.0000000 0.4485669 0.6479368 0.4485669 0.6479368
5 raceMexican American 2.4296216 0.1375046 6.456041 5472.583 0.0000000 1.8555327 3.1813298 1.8555327 3.1813298
6 raceOther Hispanic 1.7518320 0.1748554 3.206433 5573.987 0.0013515 1.2434346 2.4680953 1.2434346 2.4680953
7 raceNH Black 1.9757793 0.1198118 5.683602 5576.734 0.0000000 1.5621842 2.4988753 1.5621842 2.4988753
8 raceOther/Multi 2.1120110 0.1530066 4.886328 4749.963 0.0000011 1.5646727 2.8508138 1.5646727 2.8508138

After addressing missing BMI values using Multivariate Imputation by Chained Equations (MICE), pooled estimates from five imputations were analyzed using a survey-weighted multiple logistic regression model.

  • Age remained the strongest predictor of diabetes (OR = 2.90, 95% CI: 2.60–3.24, p < 0.001).
  • BMI continued to show a significant positive association (OR = 1.73, 95% CI: 1.58–1.89, p < 0.001).
  • Female sex was associated with lower odds of diabetes compared to males (OR = 0.54, 95% CI: 0.45–0.65, p < 0.001).
  • Compared to Non-Hispanic Whites, higher odds of diabetes were observed among: Mexican Americans (OR = 2.43, 95% CI: 1.86–3.18) Other Hispanics (OR = 1.75, 95% CI: 1.24–2.47) Non-Hispanic Blacks (OR = 1.98, 95% CI: 1.56–2.50) Other/Multi-racial groups (OR = 2.11, 95% CI: 1.56–2.85)
  • All associations were statistically significant (p < 0.01).
Code
library(gt)

# Bayesian Logistic Regression (formula weights) 
adult_imp1 <- complete(imp, 1) %>%
  dplyr::mutate(
    age_c  = as.numeric(scale(age)),
    bmi_c  = as.numeric(scale(bmi)),
    wt_norm = WTMEC2YR / mean(WTMEC2YR, na.rm = TRUE),
    # ensure factor refs match survey/mice:
    race = forcats::fct_relevel(race, "NH White"),
    sex  = forcats::fct_relevel(sex,  "Male")
  ) %>%
  dplyr::filter(!is.na(diabetes_dx), !is.na(age_c), !is.na(bmi_c),
                !is.na(sex), !is.na(race)) %>%
  droplevels()

stopifnot(all(is.finite(adult_imp1$wt_norm)))

glimpse(adult_imp1)
Rows: 5,592
Columns: 11
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, 23.6, 26.5, 22.0, 20.3…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33319172, -0.06755778, -0.02561558, -1.31184309, 1.7639…
$ wt_norm     <dbl> 0.3393916, 0.6160884, 1.4398681, 1.6500477, 0.6380722, 1.5…
Code
library(tableone)

vars <- c("age", "bmi", "age_c", "bmi_c", "wt_norm", "sex", "race", "diabetes_dx")

table1 <- CreateTableOne(vars = vars, data = adult_imp1, factorVars = c("sex", "race", "diabetes_dx"))
print(table1, showAllLevels = TRUE)
                     
                      level            Overall      
  n                                     5592        
  age (mean (SD))                      48.84 (17.57)
  bmi (mean (SD))                      29.00 (7.11) 
  age_c (mean (SD))                    -0.02 (1.00) 
  bmi_c (mean (SD))                    -0.01 (0.99) 
  wt_norm (mean (SD))                   1.00 (0.79) 
  sex (%)             Male              2669 (47.7) 
                      Female            2923 (52.3) 
  race (%)            NH White          2398 (42.9) 
                      Mexican American   742 (13.3) 
                      Other Hispanic     489 ( 8.7) 
                      NH Black          1141 (20.4) 
                      Other/Multi        822 (14.7) 
  diabetes_dx (%)     0                 4974 (88.9) 
                      1                  618 (11.1) 

Summary of Imputed Adult Dataset (N = 5,592)

Code
## correlation matrix
library(ggplot2)
library(reshape2)

correlation_matrix <- cor(adult_imp1[, c("diabetes_dx", "age", "bmi")], use = "complete.obs", method = "pearson")
correlation_melted <- melt(correlation_matrix)

ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "Features", y = "Features")

Visualization of imputed dataset

Pairwise correlations

A heatmap visualizing variables: diabetes_dx, age, and bmi show the strength and direction of correlations (Pearson correlation) which measures linear association between variables.

Code
# Class distribution

ggplot(adult_imp1, aes(x = factor(diabetes_dx))) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Diabetes Diagnosis Distribution",
    x = "Diabetes Diagnosis (0 = No, 1 = Yes)",
    y = "Count"
  ) +
  theme_minimal()

Code
prop.table(table(adult_imp1$diabetes_dx))

       0        1 
0.889485 0.110515 
Code
# Visualization of Diabetes vs BMI (adult_data1)

library(ggplot2)

# Create the plot
ggplot(adult_imp1, aes(x = factor(diabetes_dx), y = bmi, fill = factor(diabetes_dx))) +
  geom_boxplot(alpha = 0.7) +
  scale_x_discrete(labels = c("0" = "No Diabetes", "1" = "Diabetes")) +
  labs(
    x = "Diabetes Diagnosis",
    y = "BMI",
    title = "BMI Distribution by Diabetes Status"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code
# logistic regression curve
ggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +
  geom_point(aes(y = diabetes_dx), alpha = 0.2, position = position_jitter(height = 0.02)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "blue") +
  labs(
    x = "BMI",
    y = "Probability of Diabetes",
    title = "Predicted Probability of Diabetes vs BMI"
  ) +
  theme_minimal()

Code
# Save your dataset as CSV
write.csv(adult_imp1, "adult_imp1.csv", row.names = FALSE)
  • Survey weights-Normalized MEC exam weights (wt_norm) with mean 1.00 (SD 0.79)
  • Data readiness for Bayesian logistic regression:
    • No missing values remain in selected predictors or outcome.
    • Continuous variables are standardized, which facilitates prior specification.
    • Categorical variables are correctly re-leveled for reference categories.
    • Weights are available for inclusion in the likelihood to account for survey design.

Bayesian Logistic Regression analysis on imputed dataset (adult_imp1)

  1. Model Overview - A Bayesian logistic regression model was fitted on the first imputed dataset (adult_imp1) to assess predictors of diabetes diagnosis.
  • Continuous predictors—age and BMI—were standardized (per 1 SD), and survey weights were normalized (wt_norm) to maintain representativeness under the NHANES complex sampling design.
  • Categorical variables (sex and race) were factor-leveled using Male and Non-Hispanic White as reference groups.
  1. Prior Specification - to stabilize estimation in the presence of correlated predictors or outliers while retaining interpretability of model parameters.
  • Intercept prior: student_t(3, 0, 10) — allowing heavy tails for flexibility in the intercept estimate. R. V. D. Schoot et al. (2013)
  • Regression coefficients prior: normal(0, 2.5) — providing weakly informative regularization provide gentle regularization, constraining extreme values without overpowering the data R. van de Schoot et al. (2021)
  1. Model Estimation
  • The model estimates using four Markov Chain Monte Carlo (MCMC) chains, each with 2000 iterations (50% warm-up), and an adaptive delta of 0.95 ensure good chain convergence and reduce divergent transitions.
  • Posterior summaries represent the central tendency and uncertainty around the model parameters through credible intervals (CrI).
Code
library(gt)

priors <- c(
  set_prior("normal(0, 2.5)", class = "b"),
  set_prior("student_t(3, 0, 10)", class = "Intercept") 
)

bayes_fit <- brm(
  formula = diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race,
  data    = adult_imp1,
  family  = bernoulli(link = "logit"),
  prior   = priors,
  chains  = 4, iter = 2000, seed = 123,
  control = list(adapt_delta = 0.95),
  refresh = 0   # quiet Stan output
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
summary(bayes_fit)            # Bayesian model summary
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Bayesian results: - Age and BMI both were positively associated with diabetes risk. - Higher standardized values corresponded to higher posterior odds of diabetes. Sex: Females showed lower odds of diabetes compared to males. - Race/Ethnicity: Certain racial/ethnic groups demonstrated reduced odds of diabetes compared to the Non-Hispanic White reference group.

Code
library(ggplot2)

# adult_imp1 plot 

# Convert to long format
adult_long <- adult_imp1 %>%
  select(bmi_c, age_c) %>%
  pivot_longer(cols = everything(), names_to = "Coefficient", values_to = "Value")

# Plot
ggplot(adult_long, aes(x = Value, fill = Coefficient)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Distributions for Coefficients from adult_imp1 data",
       x = "Coefficient Value", y = "Density") +
  scale_fill_manual(values = c("bmi_c" = "skyblue", "age_c" = "orange"))

Code
## prior draws 

prior_draws <- tibble(
  term = rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each = 4000),
  value = c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5))
)

## Plot (prior) (age and bmi) 
ggplot(prior_draws, aes(x = value, fill = term)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Prior Distributions for Coefficients",
       x = "Coefficient Value", y = "Density") +
  scale_fill_manual(values = c("skyblue", "orange"))

Visualization of Priors and Data Distributions

  1. Data-Derived Coefficient Distributions

The density plots of standardized BMI and Age from the imputed dataset (adult_imp1) show approximately normal distributions centered near zero, consistent with z-score standardization confirming that both predictors were properly centered and scaled prior to Bayesian modeling, ensuring comparability and numerical stability during estimation.

  1. Prior Distributions

Priors for regression coefficients were drawn from a Normal(0, 2.5) distribution, representing weakly informative assumptions centered at zero with moderate spread. The prior density plots for Age (per 1 SD) and BMI (per 1 SD) demonstrate symmetric bell-shaped distributions, indicating no strong bias toward positive or negative effects before observing data.

Interpretation

Visualizations validate the standardization of continuous predictors (ensuring interpretability per 1 SD increase), and the reasonableness of priors, which are broad enough to allow data-driven inference yet prevent overfitting.

Visualization from Posterior Draw for predictive checking in Bayesian statistics

Workflow presented here combines posterior summaries, visualization of distributions, odds ratio interpretation, predictive checking, and convergence diagnostics to validate Bayesian models

  1. Posterior Summaries and Distributions to understand parameter estimates, uncertainty, and credible intervals. summary(bayes_fit)
  • Provides posterior mean, median, 95% credible intervals
  • Convergence diagnostics (R-hat, effective sample size)
  • plot(bayes_fit) – Visualizes posterior distributions- shows wide distributions indicate high uncertainty, narrow distributions indicate precise estimates.
  1. Posterior Odds Ratios - provides interpretation of the model coefficients on a multiplicative scale.
  • Extracted posterior summaries for regression coefficients, then transformed estimates and credible intervals to odds ratios (OR = exp(Estimate)) with reference categories: NH White (race), Male (sex).
  • Reported ORs with 95% credible intervals (CrI) allow direct interpretation of associations between predictors and outcome.
  1. Posterior Predictive Checks (PPC)
  • pp_check(bayes_fit)- assesses how the model reproduces observed data and validate model fit.
  • Generates simulated datasets from the posterior and compares them to the observed data.
  • Visualizations include
  • density overlays
  • histograms
  • mean/SD comparison
  • bar plots

There was no lLarge discrepancies indicating potential misfit; there was good alignment suggesting reliable predictions.

  1. MCMC Convergence - endure reliable posterior estimates.
  • mcmc_trace(bayes_fit) -
  • Trace plots show chains for each parameter over iterations.
  • Well-mixed chains without trends indicate convergence and stable posterior estimates.
  1. Model Fit -provided details to quantify predictive performance.
  • bayes_R2(bayes_fit)
  • Bayesian R² indicates the proportion of variance explained by the model.
  • R² = 0.13 (13%) → predictors are relevant but other factors (e.g., genetics, lifestyle, environment) also contribute to outcome variability.
  1. Correlation and Parameter Relationships (Optional)
  • mcmc_pairs(posterior) – Pairwise plots explore correlations between parameters.
  • mcmc_hist() or mcmc_areas() – Histograms or density plots of specific parameters.
  • detect no collinearity or dependencies among predictors
Code
library(brms)
summary(bayes_fit)
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(bayes_fit)   # Posterior distributions

Code
pp_check(bayes_fit)      # Posterior predictive checks

Code
mcmc_trace(bayes_fit)    # Convergence (optional)

Code
bayes_R2(bayes_fit)      # Model fit
    Estimate  Est.Error      Q2.5    Q97.5
R2 0.1313342 0.01265055 0.1064607 0.156078

Below are tabulted odds ratio from the posterior predicted values

Code
# Posterior ORs (drop intercept, clean labels)

bayes_or <- posterior_summary(bayes_fit, pars = "^b_") %>%
  as.data.frame() %>%
  tibble::rownames_to_column("raw") %>%
  dplyr::mutate(
    term = gsub("^b_", "", raw),
    term = gsub("race", "race:", term),
    term = gsub("sex",  "sex:",  term),
    term = gsub("OtherDMulti", "Other/Multi", term),
    term = gsub("OtherHispanic", "Other Hispanic", term),
    OR   = exp(Estimate),
    LCL  = exp(Q2.5),
    UCL  = exp(Q97.5)
  ) %>%
  dplyr::select(term, OR, LCL, UCL) %>%
  dplyr::filter(term != "Intercept")

knitr::kable(
  bayes_or %>%
    dplyr::mutate(dplyr::across(c(OR,LCL,UCL), ~round(.x, 2))),
  digits = 2,
  caption = "Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)"
)
Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)
term OR LCL UCL
age_c 2.99 2.64 3.37
bmi_c 1.87 1.71 2.05
sex:Female 0.52 0.42 0.63
race:MexicanAmerican 2.00 1.41 2.84
race:Other Hispanic 1.54 0.93 2.43
race:NHBlack 1.71 1.28 2.27
race:Other/Multi 2.27 1.56 3.28

Below is the compact table comparing odds ratios (ORs) and 95% confidence/credible intervals (CIs) for BMI and Age across three modeling approaches:

  • Survey-weighted maximum likelihood estimation (MLE)
  • Multiple imputation pooled estimates (mice)
  • Bayesian Logistic Regression
Code
# Results

 #Build compact results table (BMI & Age only) 
library(dplyr); 
library(tidyr); 
library(knitr); 
library(stringr)

# pretty "OR (LCL–UCL)" string

  fmt_or <- function(or, lcl, ucl, digits = 2) {
  paste0(
    formatC(or,  format = "f", digits = digits), " (",
    formatC(lcl, format = "f", digits = digits), "–",
    formatC(ucl, format = "f", digits = digits), ")"
  )
}

# guardrails: require these to exist from Modeling
stopifnot(exists("svy_or"), exists("mi_or"), exists("bayes_or"))
for (nm in c("svy_or","mi_or","bayes_or")) {
  if (!all(c("term","OR","LCL","UCL") %in% names(get(nm)))) {
    stop(nm, " must have columns: term, OR, LCL, UCL")
  }
}

svy_tbl   <- svy_or   %>% mutate(Model = "Survey-weighted MLE")
mi_tbl    <- mi_or    %>% mutate(Model = "mice pooled")
bayes_tbl <- bayes_or %>% mutate(Model = "Bayesian")

all_tbl <- bind_rows(svy_tbl, mi_tbl, bayes_tbl) %>%
  mutate(term = case_when(
    str_detect(term, "bmi_c|\\bBMI\\b") ~ "BMI (per 1 SD)",
    str_detect(term, "age_c|\\bAge\\b") ~ "Age (per 1 SD)",
    TRUE ~ term
  )) %>%
  filter(term %in% c("BMI (per 1 SD)", "Age (per 1 SD)")) %>%
  mutate(OR_CI = fmt_or(OR, LCL, UCL, digits = 2)) %>%
  select(Model, term, OR_CI) %>%
  arrange(
    factor(Model, levels = c("Survey-weighted MLE","mice pooled","Bayesian")),
    factor(term,  levels = c("BMI (per 1 SD)","Age (per 1 SD)"))
  )

res_wide <- all_tbl %>%
  pivot_wider(names_from = term, values_from = OR_CI) %>%
  rename(
    `BMI (per 1 SD) OR (95% CI)` = `BMI (per 1 SD)`,
    `Age (per 1 SD) OR (95% CI)` = `Age (per 1 SD)`
  )

kable(
  res_wide,
  align = c("l","c","c"),
  caption = "Odds ratios (per 1 SD) with 95% CIs across models"
)
Odds ratios (per 1 SD) with 95% CIs across models
Model BMI (per 1 SD) OR (95% CI) Age (per 1 SD) OR (95% CI)
Survey-weighted MLE 1.89 (1.65–2.15) 3.03 (2.70–3.40)
mice pooled 1.73 (1.58–1.89) 2.90 (2.60–3.24)
Bayesian 1.87 (1.71–2.05) 2.99 (2.64–3.37)
Code
# Posterior predictive draws

#Posterior predictive checks (binary outcome)
pp_samples <- posterior_predict(bayes_fit, ndraws = 500)  # 500 draws

# Check dimensions
dim(pp_samples)  # rows = draws, cols = observations
[1]  500 5592
Code
# Plot overlay of observed vs predicted counts (duplicate image)
ppc_dens_overlay(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +
  labs(title = "Posterior Predictive Check: Density Overlay") +
  theme_minimal()

Code
ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ])

Code
#PP check for proportions (useful for binary) mean and sd comparison to check if the simulated means match the observed mean

## mean
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "mean") +
  labs(title = "Posterior Predictive Check: Mean of Replicates") +
  theme_minimal()

Code
## sd
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "sd") +
  labs(title = "PPC: Standard Deviation of Replicates") +
  theme_minimal()

Comparative Visualizations

  • Predicted vs observed - to check how well the model’s predictions align with reality where mean(y_rep) = average predicted probability of diabetes for each individual, across all posterior draws of the parameters. y = the actual observed diabetes status (0 = non-diabetic, 1 = diabetic).
  • mcmc plot (dens plots) comparing observed and posterior parameter values (estimates) for bmi_c, age_c, sex_female, and by race categories
  • Fitted (Predicted) vs observed for bmi using point and error bars
  • Fitted (Predicted) vs observed for bmi using line plot

Comparative Visualizations for Model Assessment

To evaluate how well the Bayesian model predicts diabetes-related outcomes, we conducted several visual comparisons between observed and posterior-predicted values:

  1. Posterior Parameter Distributions to assess uncertainty and effect sizes of predictors, we extracted posterior draws using as_draws_df(bayes_fit).
  • Plotted density overlays (mcmc_areas) for key predictors: age, BMI, sex, and race categories: shows density plots with distribution of posterior estimates to visualize uncertainty and parameter magnitude.
  1. Predicted vs Observed Values to evaluate model fit by comparing predicted outcomes to actual data using fitted(bayes_fit, summary = TRUE).
  • Compared with the observed values for continuous predictors (e.g., BMI and age).
  • Visualization in Scatter plots with point estimates and 95% credible intervals as error bars shows Line of perfect agreement (slope = 1, dashed red line) for reference.
  • Scatter and error bar plots indicate predicted BMI values align with observed BMI across individuals.
  • Good alignment along the 45° line suggests reliable predictions; deviations highlight areas where the model may under- or over-predict.
  • Overall, these visualizations complement posterior summaries and predictive checks, supporting model validation and interpretation.

Predicted vs Observed BMI

To evaluate model fit at the individual level, we compared observed BMI values to posterior-predicted BMI estimates from the Bayesian model:

  1. Data Preparation

A combined comaprative results from observed BMI (bmi) and predicted posterior estimates (predicted_bmi) into a single dataset with 95% credible intervals (lower_ci, upper_ci) from the posterior draws in a Line plot of BMI over observations:

  • Observed BMI: solid line.
  • Predicted BMI: solid line of posterior means.
  • Shaded ribbon: 95% credible interval around predicted values to visualize uncertainty.
  1. Interpretation
  • Close alignment of predicted lines with observed BMI indicates good model fit.
  • Wider ribbons highlight greater posterior uncertainty for individual predictions.
  • Summary statistics for bmi and standardized bmi_c help contextualize the observed range and variability in the sample.
Code
# Combine observed and predicted into one data frame
plot_data <- adult_imp1 %>%
  mutate(
    predicted_bmi = predicted[, "Estimate"],
    lower_ci = predicted[, "Q2.5"],
    upper_ci = predicted[, "Q97.5"],
    obs_index = 1:nrow(adult_imp1)  # index for x-axis
  )

# Line plot
ggplot(plot_data, aes(x = obs_index)) +
  geom_line(aes(y = bmi, color = "Observed")) +               # observed BMI
  geom_line(aes(y = predicted_bmi, color = "Predicted")) +   # predicted BMI
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha = 0.2) +  # uncertainty
  labs(x = "Observation", y = "BMI", color = "Legend") +
  theme_minimal()

Code
summary(adult_imp1$bmi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   14.1    24.1    27.7    29.0    32.4    82.9 
Code
summary(plot_data$bmi_c)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.09476 -0.69669 -0.19338 -0.01133  0.46371  7.52398 

Centering for bmi - Summary of original bmi (observed data) and centered version of BMI. - Centering doesn’t change the distribution shape, only shifts it so the mean is zero. - Centering is useful in regression/Bayesian models to improve numerical stability and interpretability of intercepts - Plots showing predicted values of bmi and age (prior vs predicted) and the proportion of diabetes=1 for each draw

Visualization on Prior vs Posterior Distributions

  • To assess how the Bayesian model updates beliefs from prior information to posterior estimates, we compared prior vs posterior coefficient distributions for key predictors: BMI and age.
  1. Prior Draws
  • Simulated from a standard normal distribution (mean = 0, SD = 1) for both BMI and age coefficients. Represent initial beliefs about coefficient values before seeing the data.
  1. Posterior Draws
  • Extracted from the fitted model (bayes_fit) for b_bmi_c and b_age_c.
  • Pivoted to long format and labeled as “Posterior”.
  1. Visualization Combined prior and posterior draws
  • Plotted density overlays with facets for BMI and age.
  • Posterior distributions are narrower and often shifted from prior, reflecting information gained from the data.
  • Differences between prior and posterior highlight the model’s learning about effect sizes.
  • Posterior Predictive Proportions of Diabetes
  • Computed the proportion of diabetes cases (diabetes = 1) for each posterior draw (pp_samples).

Summary statistics: - mean(pp_proportion) – average predicted prevalence. - Compared to observed prevalence: Survey-weighted mean: svymean(~diabetes_dx, nhanes_design_adult) - Observed sample mean: mean(adult_imp1$diabetes_dx) - Visualization (Histogram) of posterior proportions shows variability and uncertainty in predicted diabetes prevalence. - Alignment of posterior mean with observed prevalence indicates good predictive performance.

Interpretaion: - Prior vs posterior plots demonstrate that the Bayesian model updates prior beliefs in a data-informed way. - Posterior predictive proportions closely match observed prevalence, supporting model reliability for inference and prediction.

Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
prior_draws <- tibble(
  term = rep(c("BMI (per 1 SD)", "Age (per 1 SD)"), each = 4000),
  estimate = c(rnorm(4000, 0, 1), rnorm(4000, 0, 1)),
  type = "Prior"
)

post
# A draws_df: 1000 iterations, 4 chains, and 11 variables
   b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1         -2.6     1.1    0.70       -0.71                  0.67
2         -2.7     1.0    0.62       -0.57                  0.65
3         -2.6     1.1    0.65       -0.76                  0.63
4         -2.7     1.0    0.65       -0.67                  0.82
5         -2.6     1.1    0.61       -0.73                  0.75
6         -2.5     1.0    0.60       -0.77                  0.61
7         -2.8     1.1    0.66       -0.66                  0.52
8         -2.8     1.2    0.67       -0.57                  0.94
9         -2.8     1.1    0.65       -0.52                  0.84
10        -2.6     1.1    0.67       -0.85                  0.70
   b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1                0.605          0.52              0.95
2                0.338          0.45              0.69
3                0.566          0.63              0.54
4                0.453          0.61              0.78
5                0.090          0.50              0.62
6                0.015          0.48              0.60
7                0.736          0.50              0.84
8                0.913          0.57              1.07
9                0.570          0.66              0.81
10               0.467          0.54              0.97
# ... with 3990 more draws, and 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
head(post)
# A draws_df: 6 iterations, 1 chains, and 11 variables
  b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1        -2.6     1.1    0.70       -0.71                  0.67
2        -2.7     1.0    0.62       -0.57                  0.65
3        -2.6     1.1    0.65       -0.76                  0.63
4        -2.7     1.0    0.65       -0.67                  0.82
5        -2.6     1.1    0.61       -0.73                  0.75
6        -2.5     1.0    0.60       -0.77                  0.61
  b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1               0.605          0.52              0.95
2               0.338          0.45              0.69
3               0.566          0.63              0.54
4               0.453          0.61              0.78
5               0.090          0.50              0.62
6               0.015          0.48              0.60
# ... with 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
library(brms)
library(tidyr)

# Extract posterior draws as dataframe

post <- as_draws_df(bayes_fit) %>%           # bayes_fit = your brms model
  select(b_bmi_c, b_age_c) %>%               # select your coefficient columns
  pivot_longer(
    everything(),
    names_to = "term",
    values_to = "estimate"
  ) %>%
  mutate(
    term = case_when(
      term == "b_bmi_c" ~ "BMI (per 1 SD)",
      term == "b_age_c" ~ "Age (per 1 SD)"
    ),
    type = "Posterior"
  )

## visualization of prior and predicted draws
combined_draws <- bind_rows(prior_draws, post) 

library(ggplot2)

ggplot(combined_draws, aes(x = estimate, fill = type)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~ term, scales = "free", ncol = 2) +
  theme_minimal(base_size = 13) +
  labs(
    title = "Prior vs Posterior Distributions",
    x = "Coefficient estimate",
    y = "Density",
    fill = ""
  )

Code
# Compute proportion of diabetes=1 for each draw
pp_proportion <- rowMeans(pp_samples)  # proportion of 1's in each posterior draw

# Summary of posterior proportions
summary(pp_proportion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09478 0.10497 0.10891 0.10941 0.11342 0.13126 
Code
# Optional: visualize the posterior probability distribution
pp_proportion_df <- tibble(proportion = pp_proportion)

ggplot(pp_proportion_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
  labs(
    title = "Posterior Distribution of Proportion of Diabetes = 1",
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

Code
svy_mean <- svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
Comparison of Diabetes Prevalence Across Methods
Method diabetes_mean SE
Survey-weighted mean (NHANES) 0.0890 0.0048
Imputed dataset mean 0.1105 NA
Posterior predictive mean 0.1094 NA

Each value comes from a posterior predictive draw

Comaprison of Prior and Predicted draws for both Age and BMI

  • Plot shows the posterior distributions are much more concentrated around ~0.5 (example) than the priors indicating the data provided strong evidence about the effect size of these covariates.
  • The priors were diffuse, showing initial uncertainty
  • The posteriors are precise, showing learning from the data.

Propotion of diabetes in the posterior draws

  • Represents the predicted probability of diabetes = 1 in the population according to Bayesian model.
  • Min = 0.085 → In some posterior draws, only ~8.5% of the population is predicted to have diabetes. 1st Quartile = 0.105 → 25% of posterior draws predict diabetes prevalence below 10.5%.
  • Median = 0.109 → Half of the simulated draws predict a prevalence below ~10.9%, half above.
  • Mean = 0.109 → The average predicted prevalence is ~10.9%, very close to the median → roughly symmetric distribution. 3rd Quartile = 0.113 → 75% of draws predict prevalence below ~11.3%.
  • Max = 0.128 → The highest predicted prevalence across all draws is ~12.8%.

Bayesian model predicts - that about 10–11% of this population has diabetes, with a relatively narrow range across posterior draws, reflects uncertainty in the estimate - while most predictions cluster around 10–11%, the model allows for values as low as 8.5% and as high as 12.8%. - On comparing this with the raw imputed data proportion show that the the model predictions align with the observed/imputed data.

Its clinical relevance: - predicted proportion can be interpreted as the expected prevalence of diabetes in the target population, accounting for uncertainty in the model and imputed data. - Policy makers or clinicians could plan interventions accordingly anticipating ~1 in 10 adults in this population might have diabetes.

Posterior predicted proportion of Diabetes vs NHANES prevalence of Diabetes in the population

Code
library(tidyverse)

# Posterior predicted proportion vector
# pp_proportion <- rowMeans(pp_samples)  # if not already done

known_prev <- 0.089   # NHANES prevalence

# Posterior summary
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# Create a data frame for plotting
pp_df <- tibble(proportion = pp_proportion)

# Plot
ggplot(pp_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.005, fill = "skyblue", color = "black") +
  geom_vline(xintercept = known_prev, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = posterior_mean, color = "blue", linetype = "solid", size = 1) +
  geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin = 0, ymax = Inf),
            fill = "blue", alpha = 0.1, inherit.aes = FALSE) +
  labs(
    title = "Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    subtitle = paste0("Red dashed = NHANES prevalence (", known_prev, 
                      "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

Posterior Predicted Diabetes Proportion vs Observed Prevalence

To evaluate the Bayesian model’s predictive accuracy for diabetes prevalence, we compared the posterior predicted proportion of diabetes cases to the known NHANES prevalence:

  1. Posterior Predictions - Calculated the proportion of diabetes = 1 for each posterior draw (pp_proportion). Derived posterior mean and 95% credible interval to summarize predictive uncertainty.

  2. Visualization - Histogram of posterior predicted proportions illustrates the variability in model predictions.

Red dashed line: NHANES observed prevalence (0.089). Blue solid line: Posterior mean predicted prevalence. Shaded blue region: 95% credible interval around the posterior mean.

Interpretation - Close alignment of the posterior mean and credible interval with the observed NHANES prevalence indicates that the model accurately captures the population-level prevalence of diabetes. - This visualization complements prior vs posterior and predicted vs observed checks, supporting overall model validity.

Code
library(dplyr)

# Posterior predicted proportion

posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# NHANES prevalence with SE from survey::svymean
# Suppose you already have:
# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
known_prev <- 0.089        # Mean prevalence
known_se   <- 0.0048       # Standard error from survey

# Calculate 95% confidence interval
known_ci <- c(
  known_prev - 1.96 * known_se,
  known_prev + 1.96 * known_se
)

# Print results
data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(posterior_mean, known_prev),
  Lower_95 = c(posterior_ci[1], known_ci[1]),
  Upper_95 = c(posterior_ci[2], known_ci[2])
)
                     Type      Mean   Lower_95  Upper_95
2.5% Posterior Prediction 0.1094142 0.09861856 0.1213385
        NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
library(ggplot2)
library(dplyr)

# Create a data frame for plotting
ci_df <- data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(0.1096674, 0.089),
  Lower_95 = c(0.09772443, 0.079592),
  Upper_95 = c(0.1210658, 0.098408)
)

# Plot
ggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width = 0.2) +
  ylim(0, max(ci_df$Upper_95) + 0.02) +
  labs(
    title = "Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    y = "Proportion of Diabetes",
    x = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

Code
# --- Load libraries ---
library(survey)
library(tibble)
library(ggplot2)

# --- 1. Survey-weighted (Population) prevalence ---
pop_est <- svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
pop_prev <- as.numeric(pop_est)
pop_se <- as.numeric(SE(pop_est))
pop_ci <- c(pop_prev - 1.96 * pop_se, pop_prev + 1.96 * pop_se)

# --- 2. Bayesian posterior prevalence ---
# bayes_pred = matrix of posterior draws (iterations × individuals)
pp_proportion <- rowMeans(pp_samples)             # prevalence per posterior draw
post_prev <- mean(pp_proportion)                  # posterior mean prevalence
post_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# --- 3. Combine into one data frame ---
bar_df <- tibble(
  Source     = c("Survey-weighted (Population)", "Bayesian Posterior"),
  Prevalence = c(pop_prev, post_prev),
  CI_low     = c(pop_ci[1], post_ci[1]),
  CI_high    = c(pop_ci[2], post_ci[2])
)

# --- 4. Plot ---
ggplot(bar_df, aes(x = Source, y = Prevalence, fill = Source)) +
  geom_col(alpha = 0.85, width = 0.6) +
  geom_errorbar(
    aes(ymin = CI_low, ymax = CI_high),
    width = 0.15,
    color = "black",
    linewidth = 0.8
  ) +
  guides(fill = "none") +
  labs(
    title = "Population vs Posterior Diabetes Prevalence",
    subtitle = "Survey-weighted estimate (design-based) vs Bayesian (model-based)",
    y = "Prevalence (Proportion with Diabetes)",
    x = NULL
  ) +
  theme_minimal(base_size = 13)

Comparison of Posterior Predicted Diabetes Prevalence vs Population Estimates

To assess the Bayesian model’s accuracy at the population level, we compared the posterior predicted diabetes prevalence to the survey-weighted NHANES prevalence.

  1. Survey-Weighted (Population) Prevalence
  • Calculated using svymean(~diabetes_dx, nhanes_design_adult).
  • Provided a mean prevalence and 95% confidence interval accounting for survey design.
  • Mean = 0.089, SE = 0.0048 → 95% CI = 0.080–0.098.
  1. Bayesian Posterior Predicted Prevalence
  • Computed from posterior draws (pp_samples) as the proportion of diabetes cases per draw.
  • Summarized by posterior mean and 95% credible interval.
  • Posterior mean = 0.110, 95% CrI = 0.098–0.121.
  1. Visualization
  • Bar plot comparing population vs posterior prevalence.
  • Error bars: 95% CI (survey) or 95% credible interval (Bayesian).
  • Subplots/alternative plots show points and intervals for direct comparison.

Interpretation - Posterior mean slightly higher than survey-weighted prevalence but largely overlaps with the population 95% CI. - Indicates that the Bayesian model reliably reproduces population-level diabetes prevalence, supporting both predictive accuracy and model validity.

Survey-Weighted Prevalence (8.9%)

  • Representative estimate of diabetes prevalence in the population, as it adjusts for NHANES’ complex sampling design. It’s slightly lower because survey weights give less influence to overrepresented groups (e.g., those with higher diabetes prevalence).

Imputed (Unweighted) Prevalence (11.1%)

  • Reflects the diabetes proportion from the imputed dataset. It’s unweighted, it doesn’t correct for sampling bias, so it might overrepresent some subgroups (e.g., older or overweight participants).

Posterior Predictive Mean (10.9%)

  • Bayesian model replicates the imputed data mean, suggesting good model calibration.
  • The model “learned” the sample pattern correctly — no over- or underestimation relative to the data.
  • it’s close to the imputed value but slightly below it shows the posterior distribution pulled toward the population-level mean, consistent with Bayesian shrinkage.

Summarizing

Imputed (11.1%) ≈ Posterior Predictive (10.9%) > Survey-Weighted (8.9%). The posterior predictive estimate sits between the two — balancing sample data and uncertainty.

Comparison result - Survey-weighted (NHANES) 0.089 (0.0796, 0.0984) - Bayesian Posterior 0.110 (0.098, 0.121) - The Bayesian estimate isslightly higher than the survey-based one. - This could indicate that the Bayesian model (after adjusting for predictors) predicts more undiagnosed or latent diabetes risk than captured by self-reports or current screening. It suggests potential under-detection in the survey population — prompting more active screening in certain subgroups

Implications - Health departments can estimate diabetes burden at the state or county level using Bayesian small-area estimation. - Clinicians and public health researchers can plan targeted screening where predicted prevalence is higher than observed. - Epidemiologists can validate disease models before applying them to regions without survey data.

MCMC Autocorrelation for Key Parameters

  • To evaluate the independence of posterior samples and ensure reliable Bayesian inference, we examined autocorrelation of MCMC chains for key predictors: age and BMI.
  • Extracted MCMC draws from the fitted model (bayes_fit) as a draws array (as_draws_array).
  • Plotted autocorrelation functions (mcmc_acf) for b_age_c and b_bmi_c.

Interpretation - Autocorrelation plots show how each MCMC sample is related to previous iterations. - Low autocorrelation (quick decay to zero) indicates good chain mixing and independent samples. - High autocorrelation suggests slower mixing and may require more iterations or tuning.

  1. Outcome
  • Visual inspection of autocorrelation for age and BMI confirms adequate independence of posterior draws, supporting the reliability of parameter estimates and subsequent inference.

  • bayes_fit is the fitted Bayesian model (brms or rstanarm).

  • as_draws_array() converts the posterior samples into a 3-dimensional array (iterations × chains × parameters). If we ran 4 chains with 2000 iterations each, and your model has 5 parameters, the array shape will be (2000, 4, 5).

  • mcmc_acf() - produces autocorrelation plots for the posterior samples of specified parameters (b_age_c and b_bmi). Autocorrelation measures how correlated a sample is with previous samples in the Markov Chain and checks the quality of MCMC sampling:

    • High autocorrelation → the chain moves slowly and is not exploring the posterior efficiently.
    • Low autocorrelation → the chain is mixing well (as in the model).

Model Overview and Significant Predictors

  1. Multiple Linear Regression (Survey-weighted MLE)

Significant predictors: Age: strong positive association (p < 0.001) BMI: strong positive association (p < 0.001) Sex (female): negative association (p = 0.0004) Race/Ethnicity: Mexican American (p = 0.0008) Other Hispanic (p = 0.0087) NH Black (p = 0.0117) Other/Multi (p = 0.0014)

  1. Multiple Imputation (MICE)

All predictors remain statistically significant. Positive associations: age, BMI, race categories. Negative association: female sex.

  1. Bayesian Logistic Regression

Sampling via NUTS: 4 chains × 2000 iterations (1000 warmup, 4000 post-warmup draws). Convergence diagnostics: Rhat = 1.00 → excellent convergence Bulk/Tail ESS > 2000 → reliable posterior estimates Posterior R² = 0.13 (95% CrI: 0.106–0.156) → 13% of variance in diabetes explained by predictors.

  1. Key Predictor Effects Predictor Effect Age (per 1 SD) Log-odds of diabetes ↑1.09 per unit increase; strongest positive predictor BMI (per 1 SD) Higher BMI increases diabetes risk (~1.7–1.9× odds per SD) Sex (female) Lower odds of diabetes compared to males Race/Ethnicity Mexican American, NH Black, Other/Multi: significantly higher odds; Other Hispanic: uncertain effect

Posterior density plots illustrate parameter uncertainty. Posterior predictive checks (PPC) simulate new datasets from posterior draws to assess model fit. Combining parameter uncertainty and predictive uncertainty provides credible intervals for predictions (e.g., given BMI, diabetes probability ~40–55%).

Conclusion

The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.

Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.

Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.

Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.

Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.

Discussions

The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The = Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes. Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.

Limitations

  1. The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
  2. The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
  3. Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
  4. Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
  5. Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
  6. Self-reported variables - are subjective to recall or reporting bias.
  7. Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
  8. Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.

QUESTION for targeted therapy

Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy. By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction. The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%). Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk. This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.

What changes in modifiable predictors would lower diabetes risk?

Translational Research Implications: - We now use the model to guide

prevention or intervention. - Only BMI is a modifiable risk factor here - What must change in BMI, behavior, or lifestyle to achieve a lower risk threshold? In practice, we hold non modifiable predictors as constant (sex, race). Vary modifiable predictors (BMI) until the model predicts the desired probability.

Internal validation

  • To illustrate personalized risk estimation using the Bayesian model, we computed the posterior predicted probability of diabetes for a representative participant.
  1. Data and Prediction
  • Selected one participant from the dataset (adult[1, ]) including all relevant covariates (age, BMI, sex, race).
  • Used posterior_linpred with transform = TRUE to obtain predicted probabilities for logistic regression.
  1. Posterior Predictive Distribution
  • Extracted posterior draws for the individual’s predicted probability.
  • Computed 95% credible interval from the posterior draws.
  • Density plot shows the distribution of plausible probabilities given the participant’s covariates.

Interpretation for Targeted Implementation - The density highlights uncertainty around the individual’s predicted diabetes risk. - 95% credible interval provides a range of probable outcomes, not just a point estimate. - This approach allows personalized risk assessment, enabling clinicians or public health practitioners to:

Clinical implementation - Identify high-risk individuals - Tailor preventive interventions (e.g., lifestyle modification, monitoring) - Quantify uncertainty in predictions for decision-making - Posterior predictive distributions enable probabilistic, individualized predictions, supporting targeted intervention strategies beyond population-level summaries.

Code
# Use the first participant 
# using multiple covariates to select someone
participant1_data  <- adult[1, ]


# predicted probabilities for patient 1
phat1 <- posterior_linpred(bayes_fit, newdata = participant1_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression

# Store in a data frame for plotting
post_pred_df <- data.frame(pred = phat1)

# Compute 95% credible interval
ci_95_participant1 <- quantile(phat1, c(0.025, 0.975))

# Plot

ggplot(post_pred_df, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue') +
  geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
  theme_bw()

Predicting Diabetes Risk for a New Participant - To demonstrate the application of the Bayesian model for personalized prediction, we applied the trained model to a new participant not included in the original dataset.

  1. Method
  • Selected a new participant with specific covariates (age, BMI, sex, race).
  • Used posterior_linpred with transform = TRUE to compute posterior predicted probabilities of diabetes. Generated posterior draws to capture predictive uncertainty.
  1. Posterior Predictive Distribution
  • Created a density plot of predicted probabilities. Computed 95% credible interval to summarize the range of likely outcomes.
  • Red dashed lines indicate the lower and upper bounds of the interval.

Interpretation for Targeted Implementation - The distribution shows not only the most probable risk but also the uncertainty around it. - Credible intervals help quantify confidence in individual-level predictions. - Supports personalized decision-making, such as targeted lifestyle interventions, early monitoring, or preventive care. - Bayesian posterior predictive draws allow probabilistic, individualized predictions for new participants, providing both point estimates and uncertainty measures for actionable risk assessment.

Code
library(ggplot2)


new_participant <- data.frame(
  age_c = 40,
  bmi_c = 25,
  sex   = "Female",
  race  = "Mexican American"
)

# Posterior predicted probabilities
phat_new <- posterior_linpred(bayes_fit, newdata = new_participant, transform = TRUE)

# Convert to numeric vector
phat_vec <- as.numeric(phat_new)

# Check the range to see if all values are similar
range(phat_vec)
[1] 1 1
Code
# Store in a data frame
post_pred_df_new <- data.frame(pred = phat_vec)

# Compute 95% credible interval from the vector
ci_95_new_participant <- quantile(phat_vec, c(0.025, 0.975))

# Plot
ggplot(post_pred_df_new, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue', alpha = 0.6) +
  geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +
  xlim(0, 1) +  # ensures you see the curve even if values are close
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +
  theme_bw()

Targeted BMI Analysis for Predicted Diabetes Risk

This analysis examines the relationship between BMI and the predicted probability of diabetes, holding other covariates (age, sex, race) constant, using the fitted Bayesian logistic regression model.

  • Generated a grid of BMI values (e.g., 18–40 kg/m²) for a specific demographic profile:

Age = 40 Sex = Female Race = Mexican American Computed posterior predicted probabilities of diabetes for each BMI value. Averaged across posterior draws to obtain the mean predicted probability per BMI.

  • Target Probability Approach Defined a target probability of diabetes (e.g., 0.3). Identified the BMI value whose predicted probability is closest to the target. This enables inverse prediction, linking statistical inference to clinically meaningful thresholds.

  • Visualization Line plot of predicted probability vs BMI. Red dashed horizontal line: target probability (0.3). Red dotted vertical line: BMI corresponding to the target probability (~closest BMI). Annotated to highlight the BMI threshold.

  • Interpretation Provides a practical guideline: the BMI at which an individual with a given profile reaches a predefined diabetes risk. Supports personalized risk communication and preventive interventions. Translates model output into actionable, clinically relevant thresholds, bridging research findings with public health application. This approach demonstrates how Bayesian posterior predictions can be used for targeted, individualized risk assessment, informing precision prevention strategies based on modifiable risk factors like BMI.

Implications

  • age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
  • Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.

References

Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. “Augmenting Data With Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. “Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Schoot, Rens Van De, David Kaplan, Jaap Denissen, Jens B Asendorpf, and Marcel A G Van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.” https://doi.org/10.1111/cdev.12169.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.
Zeger, Scott L, Zhenke Wu, Yates Coley, Anthony Todd Fojo, Bal Carter, Katherine O’Brien, Peter Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment,” no. 2020. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=medp&NEWS=N&AN=37708307.